{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "aba4346f",
   "metadata": {},
   "source": [
    "# Document level access in Azure AI Search using the indexer pull APIs\n",
    "\n",
    "In Azure AI Search, you can use an indexer to pull content into a search index for indexing. This notebook shows you how index blobs that have access control lists (ACLs) in Azure Storage Data Lake Storage (ADLS) Gen2, and then query the index to return only those results that the user is authorized to view.\n",
    "\n",
    "The security principal behind the query access token determines the \"user\". The ACLs on folders and files determine whether the user is authorized to view the content, and that metadata is pulled into the index along with other document content. Internally at query time, the search engine filters out any documents that aren't associated with the object ID.\n",
    "\n",
    "This feature is currently in preview.\n",
    "\n",
    "For an alternative approaching using push APIs to index any data, see [Quickstart-Document-Permissions-Push-API](../Quickstart-Document-Permissions-Push-API/document-permissions-push-api.ipynb).\n",
    "\n",
    "\n",
    "## Prerequisites\n",
    "\n",
    "+ Azure AI Search, basic tier or higher, with a [system-assigned managed identity](https://learn.microsoft.com/azure/search/search-howto-managed-identities-data-sources) and [role-based access control](https://learn.microsoft.com/azure/search/search-security-enable-roles).\n",
    "\n",
    "+ Azure Storage, general purpose account, with a [hierarchical namespace](https://learn.microsoft.com/azure/storage/blobs/create-data-lake-storage-account).\n",
    "\n",
    "+ Folders and files, where each file has an [access control list specified](https://learn.microsoft.com/azure/storage/blobs/data-lake-storage-access-control). We recommend group IDs.\n",
    "\n",
    "## Permissions\n",
    "\n",
    "This walkthrough uses Microsoft Entra ID authentication and authorization.\n",
    "\n",
    "+ On Azure Storage, **Storage Blob Data Reader** permissions are required for both the search service identity and for your user account since you are testing locally. You also need **Storage Blob Data Contributor** because the sample includes code for creating and configuring a container and its contents.\n",
    "\n",
    "+ On Azure AI Search, assign yourself **Search Service Contributor**, **Search Index Data Contributor**, and **Search Index Data Reader** permissions to create objects and run queries. For more information, see [Connect to Azure AI Search using roles](https://learn.microsoft.com/azure/search/search-security-rbac) and [Quickstart: Connect without keys for local testing](https://learn.microsoft.com/azure/search/search-get-started-rbac).\n",
    "\n",
    "## Limitations\n",
    "\n",
    "+ Parsing indexer options aren't currently supported. There's no support for CSV, JSON, or Markdown parsing."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f445040a",
   "metadata": {},
   "source": [
    "## Set up connections\n",
    "\n",
    "Save the `sample.env` file as `.env` and then modify the environment variables to use your Azure endpoints. All variables must be specified.\n",
    "\n",
    "You need endpoints for:\n",
    "\n",
    "+ Azure AI Search\n",
    "+ Azure Storage\n",
    "\n",
    "For Azure AI Search, find the endpoint in the [Azure portal](https://portal.azure.com), in the **Essentials** section of the Overview page.\n",
    "\n",
    "For Azure Storage, follow the guidance in [Get storage account configuration information](https://learn.microsoft.com/azure/storage/common/storage-account-get-info) to specify all of the variables in the `.env` file. \n",
    "\n",
    "\n",
    "## Load Connections\n",
    "\n",
    "We recommend creating a virtual environment to run this sample code. In Visual Studio Code, open the control palette (ctrl-shift-p) to create an environment. This notebook was tested on Python 3.10.\n",
    "\n",
    "Once the environment is created, load the environment variables to set up connections and object names."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "0b40bb5b",
   "metadata": {},
   "outputs": [],
   "source": [
    "from dotenv import load_dotenv\n",
    "from azure.identity import DefaultAzureCredential, get_bearer_token_provider\n",
    "import os\n",
    "\n",
    "load_dotenv(override=True) # take environment variables from .env.\n",
    "\n",
    "# The following variables from your .env file are used in this notebook\n",
    "endpoint = os.environ[\"AZURE_SEARCH_ENDPOINT\"]\n",
    "credential = DefaultAzureCredential()\n",
    "index_name = os.getenv(\"AZURE_SEARCH_INDEX\", \"document-permissions-indexer-idx\")\n",
    "indexer_name = os.getenv(\"AZURE_SEARCH_INDEXER\", \"document-permissions-indexer-idxr\")\n",
    "datasource_name = os.getenv(\"AZURE_SEARCH_DATASOURCE\", \"document-permissions-indexer-ds\")\n",
    "adls_gen2_account_name = os.getenv(\"AZURE_STORAGE_ACCOUNT_NAME\")\n",
    "adls_gen2_container_name = os.getenv(\"AZURE_STORAGE_CONTAINER_NAME\")\n",
    "adls_gen2_connection_string = os.environ[\"AZURE_STORAGE_CONNECTION_STRING\"]\n",
    "adls_gen2_resource_id = os.environ[\"AZURE_STORAGE_RESOURCE_ID\"]\n",
    "token_provider = get_bearer_token_provider(credential, \"https://search.azure.com/.default\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2d46b940",
   "metadata": {},
   "source": [
    "## Create an index\n",
    "\n",
    "The search index must include fields for your content and for permission metadata. Assign the new permission filter option to a string field and make sure the field is filterable. The search engine builds the filter internally at query time."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2f981cad",
   "metadata": {},
   "outputs": [],
   "source": [
    "from azure.search.documents.indexes.models import SearchField, SearchIndex, PermissionFilter, SearchIndexPermissionFilterOption\n",
    "from azure.search.documents.indexes import SearchIndexClient\n",
    "\n",
    "index_client = SearchIndexClient(endpoint=endpoint, credential=credential)\n",
    "index = SearchIndex(\n",
    "    name=index_name,\n",
    "    fields=[\n",
    "        SearchField(name=\"id\", type=\"Edm.String\", key=True, filterable=True, sortable=True),\n",
    "        SearchField(name=\"content\", type=\"Edm.String\", searchable=True, filterable=False, sortable=False),\n",
    "        SearchField(name=\"oids\", type=\"Collection(Edm.String)\", filterable=True, permission_filter=PermissionFilter.USER_IDS),\n",
    "        SearchField(name=\"groups\", type=\"Collection(Edm.String)\", filterable=True, permission_filter=PermissionFilter.GROUP_IDS),\n",
    "        SearchField(name=\"metadata_storage_path\", type=\"Edm.String\", searchable=True),\n",
    "        SearchField(name=\"metadata_storage_name\", type=\"Edm.String\", searchable=True)\n",
    "    ],\n",
    "    permission_filter_option=SearchIndexPermissionFilterOption.ENABLED\n",
    ")\n",
    "\n",
    "index_client.create_or_update_index(index=index)\n",
    "print(f\"Index '{index_name}' created with permission filter option enabled.\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2b8945a2",
   "metadata": {},
   "source": [
    "## Create a data source\n",
    "\n",
    "Set the `IndexerPermissionOption` so that the indexer knows to retrieve the permission metadata."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b25aaf7b",
   "metadata": {},
   "outputs": [],
   "source": [
    "from azure.search.documents.indexes.models import SearchIndexerDataSourceConnection, SearchIndexerDataSourceType, IndexerPermissionOption, SearchIndexerDataContainer, DataSourceCredentials\n",
    "from azure.search.documents.indexes import SearchIndexerClient\n",
    "indexer_client = SearchIndexerClient(endpoint=endpoint, credential=credential)\n",
    "datasource = SearchIndexerDataSourceConnection(\n",
    "    name=datasource_name,\n",
    "    type=SearchIndexerDataSourceType.ADLS_GEN2,\n",
    "    connection_string=f\"ResourceId={adls_gen2_resource_id};\",\n",
    "    container=SearchIndexerDataContainer(name=adls_gen2_container_name),\n",
    "    indexer_permission_options=[IndexerPermissionOption.GROUP_IDS]\n",
    ")\n",
    "\n",
    "indexer_client.create_or_update_data_source_connection(datasource)\n",
    "print(f\"Datasource '{datasource_name}' created with permission filter option enabled.\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ff5b912d",
   "metadata": {},
   "source": [
    "## Get group IDs\n",
    "\n",
    "This step calls the Graph APIs to get a few group IDs for your Microsoft Entra identity. Your group IDs will be added to the access control list of the objects created in the next step. Two group identifiers are retrieved. Each one is assigned to a different file."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "329fe160",
   "metadata": {},
   "outputs": [],
   "source": [
    "from msgraph import GraphServiceClient\n",
    "client = GraphServiceClient(credentials=credential, scopes=[\"https://graph.microsoft.com/.default\"])\n",
    "\n",
    "groups = await client.me.member_of.get()\n",
    "first_group_id = groups.value[0].id\n",
    "second_group_id = groups.value[1].id"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "20588dc3",
   "metadata": {},
   "source": [
    "## Upload sample directory and file\n",
    "\n",
    "This step creates the container, folders, and uploads the files into Azure Storage. It assigns your group IDs to to the access control list for each file."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "acd28b29",
   "metadata": {},
   "outputs": [],
   "source": [
    "from azure.storage.filedatalake import DataLakeServiceClient\n",
    "import requests\n",
    "\n",
    "service = DataLakeServiceClient.from_connection_string(adls_gen2_connection_string, credential=credential)\n",
    "container = service.get_file_system_client(adls_gen2_container_name)\n",
    "if not container.exists():\n",
    "    container.create_file_system()\n",
    "root_dir_client = container.get_directory_client(\"/\")\n",
    "state_parks_dir_client = container.get_directory_client(\"state-parks\")\n",
    "state_parks_dir_client.create_directory()\n",
    "root_dir_client.update_access_control_recursive(f\"group:{first_group_id}:rwx\")\n",
    "root_dir_client.update_access_control_recursive(f\"group:{second_group_id}:rwx\")\n",
    "\n",
    "oregon_dir_client = state_parks_dir_client.create_sub_directory(\"oregon\")\n",
    "oregon_dir_client.create_directory()\n",
    "file_client = oregon_dir_client.create_file(\"oregon_state_parks.csv\")\n",
    "oregon_state_parks_content = requests.get(\"https://raw.githubusercontent.com/Azure-Samples/azure-search-sample-data/refs/heads/main/state-parks/Oregon/oregon_state_parks.csv\").content.decode(\"utf-8\")\n",
    "file_client.upload_data(oregon_state_parks_content, overwrite=True)\n",
    "oregon_dir_client.update_access_control_recursive(f\"group:{first_group_id}:rwx\")\n",
    "\n",
    "washington_dir_client = state_parks_dir_client.create_sub_directory(\"washington\")\n",
    "washington_dir_client.create_directory()\n",
    "file_client = washington_dir_client.create_file(\"washington_state_parks.csv\")\n",
    "washington_state_parks_content = requests.get(\"https://raw.githubusercontent.com/Azure-Samples/azure-search-sample-data/refs/heads/main/state-parks/Washington/washington_state_parks.csv\").content.decode(\"utf-8\")\n",
    "file_client.upload_data(washington_state_parks_content, overwrite=True)\n",
    "washington_dir_client.update_access_control_recursive(f\"group:{second_group_id}:rwx\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ca6de2ad",
   "metadata": {},
   "source": [
    "## Run the indexer\n",
    "\n",
    "Start the indexer to run all operations, from data retrieval to indexing. Any connection errors or permission problems become evident here."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2ce7eb5e",
   "metadata": {},
   "outputs": [],
   "source": [
    "from azure.search.documents.indexes.models import SearchIndexer, FieldMapping\n",
    "\n",
    "indexer = SearchIndexer(\n",
    "    name=indexer_name,\n",
    "    target_index_name=index_name,\n",
    "    data_source_name=datasource_name,\n",
    "    field_mappings=[\n",
    "        FieldMapping(source_field_name=\"metadata_group_ids\", target_field_name=\"groups\"),\n",
    "        FieldMapping(source_field_name=\"metadata_user_ids\", target_field_name=\"oids\"),\n",
    "    ]\n",
    ")\n",
    "\n",
    "indexer_client.create_or_update_indexer(indexer)\n",
    "print(f\"Indexer '{indexer_name}' created\")\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "987dd496",
   "metadata": {},
   "source": [
    "## Search sample data using x-ms-query-source-authorization\n",
    "\n",
    "Wait for the indexer to finish processing before running queries. This query uses an empty search string (`*`) for an unqualified search. It returns the file name and permission metadata associated with each file. Notice that each file is associated with a different group ID."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7a899da1",
   "metadata": {},
   "outputs": [],
   "source": [
    "from azure.search.documents import SearchClient\n",
    "search_client = SearchClient(endpoint=endpoint, index_name=index_name, credential=credential)\n",
    "\n",
    "results = search_client.search(search_text=\"*\", x_ms_query_source_authorization=token_provider(), select=\"metadata_storage_path,oids,groups\", order_by=\"id asc\")\n",
    "for result in results:\n",
    "    print(f\"Path: {result['metadata_storage_path']}, OID: {result['oids']}, Group: {result['groups']}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c712ab8c",
   "metadata": {},
   "source": [
    "## Search sample data without x-ms-query-source-authorization \n",
    "\n",
    "This step demonstrates the user experience when authorization fails. No results are returned in the response."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "72d203f0",
   "metadata": {},
   "outputs": [],
   "source": [
    "from azure.search.documents import SearchClient\n",
    "search_client = SearchClient(endpoint=endpoint, index_name=index_name, credential=credential)\n",
    "\n",
    "results = search_client.search(search_text=\"*\", x_ms_query_source_authorization=None, select=\"metadata_storage_path,oids,groups\", order_by=\"id asc\")\n",
    "for result in results:\n",
    "    print(f\"Path: {result['metadata_storage_path']}, OID: {result['oids']}, Group: {result['groups']}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e1ac3c84",
   "metadata": {},
   "source": [
    "## Next steps\n",
    "\n",
    "To learn more, see [Document-level access control in Azure AI Search](https://learn.microsoft.com/azure/search/search-document-level-access-overview)."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": ".venv",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.12.10"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
